Skip to content

v0.2.5: split BPF datapath into fast_path + finalize via bpf_tail_call#45

Merged
lunarthegrey merged 4 commits into
mainfrom
v0.2.5-tail-call-finalize
May 5, 2026
Merged

v0.2.5: split BPF datapath into fast_path + finalize via bpf_tail_call#45
lunarthegrey merged 4 commits into
mainfrom
v0.2.5-tail-call-finalize

Conversation

@lunarthegrey
Copy link
Copy Markdown
Contributor

Summary

Fixes the v0.2.4 stack-budget regression on UniFi 5.15 kernels by splitting the BPF datapath into two programs connected by bpf_tail_call. Each program gets its own 512-byte stack budget. Same architecture establishes the pattern for future fast-path-internal stages (additional packet transforms, more sophisticated FIB logic) without re-bisecting stack bytes every release.

The proximate failure (reported on edge1-mci1-net): UniFi's 5.15.72-ui-cn9670 aarch64 kernel rejected v0.2.4's fast_path BPF program at combined stack size of 3 calls is 544. Too large — same bytecode that loaded cleanly on CI's qemu 5.15 vanilla x86_64 (stack depth 0+360+0+0). UniFi's BPF patches plus the aarch64 JIT account stack ~120 bytes higher than vanilla.

Architecture

                       packet ingress
                              │
                              ▼
   ┌──────────────────────────────────────────────────┐
   │ fast_path  (XDP, attached to eth0..ethN)         │  Frame A
   │   classification (allow-prefix, block-prefix)     │  fits 512B
   │   FIB lookup  (kernel-fib | custom-fib | compare)│
   │   devmap pre-check                                │
   │   TTL decrement / L2 rewrite (in-place)           │
   │   write per-CPU MUTATION_CTX                      │
   │   bpf_tail_call(MUTATION_PROGS, 0)  ──────────┐  │
   └────────────────────────────────────────────────│──┘
                                                    │
                                                    ▼
   ┌──────────────────────────────────────────────────┐
   │ finalize  (XDP, tail-called by fast_path)        │  Frame B
   │   read MUTATION_CTX                               │  fresh 512B
   │   mss-clamp lookup + (optional) MSS rewrite       │
   │   VLAN choreography (push / pop / rewrite)        │
   │   bpf_redirect_map(egress_ifindex)                │
   └──────────────────────────────────────────────────┘

mss-clamp + VLAN + redirect move from forward_success into finalize. Per-prefix LPM keys + TCP-options walk live in finalize's fresh stack budget. fast_path's responsibilities shrink to classification + L2/TTL, which fits comfortably under any kernel's accounting.

This is not the multi-module dispatcher (SPEC §3.4 / §5.0) — that's for chaining independent modules at the same hook (ddos in front of fast-path, sampler behind it). Tail-call is for splitting one logical pipeline. Both eventually exist; v0.2.5 ships only the former.

What's in the PR

Area Files Notes
New BPF program crates/modules/fast-path/bpf/src/finalize.rs #[xdp] pub fn finalize — reads MUTATION_CTX, runs mss-clamp + VLAN + redirect. ~280 LOC; mss-clamp + VLAN choreography moved here verbatim from main.rs.
BPF maps maps.rs New MutationCtx struct (16 bytes), MUTATION_CTX (PerCpuArray, single-element scratch), MUTATION_PROGS (ProgramArray, 8 slots). New StatIdx 35/36 (err_tail_call, err_mutation_ctx).
BPF program main.rs forward_success writes MutationCtx and tail-calls into MUTATION_PROGS[0] instead of doing mss-clamp + VLAN + redirect inline. ~440 LOC of mss-clamp + VLAN choreography moved out (now in finalize.rs).
Userspace lifecycle linux_impl.rs attach() loads finalize first → populates MUTATION_PROGS[0] with finalize's FD → loads + attaches fast_path. Order matters. New populate_mutation_progs helper. New tail_call_chain_from_pin for status reporting.
Pin lifecycle pin.rs New FINALIZE_PROGRAM_NAME constant + PROGRAM_NAMES array. MAP_NAMES grows to 19 (added MSS_CLAMP_V4/V6/BY_IFACE that were missing from v0.2.4, plus MUTATION_CTX/PROGS). pin_program_and_maps walks both program names.
CLI status loader.rs New "tail-call chain" section reports MUTATION_PROGS[0] occupancy. Stat names array updated for indices 33-36 (mss-clamp + tail-call diagnostics — bringing it in sync with what's actually defined).
Test harness tests/common/mod.rs Harness::new now loads both programs and populates MUTATION_PROGS[0] before returning. bpf_prog_test_run follows tail-calls (kernel re-enters its dispatcher for the target program), so existing tests transparently see the full chain's verdict + mutations.
Docs README.md, new docs/runbooks/tail-call-architecture.md Status table row for v0.2.5+; new runbook covers chain topology, MutationCtx wire format, debug commands (bpftool prog show, bpftool map dump MUTATION_PROGS), and how future stages slot in.
Version Cargo.toml, VERSION, README install snippets 0.2.4 → 0.2.5

What's deliberately NOT in this PR

  • Netns end-to-end integration test (tests/tail_call.rs was in the plan). The kernel BPF_PROG_TEST_RUN harness already exercises the tail-call via the existing fixtures (now updated to populate MUTATION_PROGS[0]). A real-veth + AF_PACKET capture test is good additional coverage but ~150 LOC of test infra and was the right thing to defer to keep this PR focused on the architecture change. Will land as a follow-up.
  • bpf_fib_lookup per-CPU map move (mentioned as "alternative B" in earlier discussion). Tail-call obviates it; we have plenty of stack headroom now.
  • Multi-stage chain (slots 1-7 of MUTATION_PROGS). Reserved capacity is in place for future use; no actual stages today.
  • The dispatcher (SPEC §5.0). Different problem entirely — for ddos / sampler / randomizer composition.

CI expectations

Existing CI matrix should pass:

  • ✅ fmt + clippy + test (workspace lib tests pass on macOS dev: 94 + 40)
  • ✅ Cross-build matrix (4 targets) — userspace-only, no BPF concern
  • ✅ qemu-verifier 5.15 + 6.6 — should load both programs, attach succeeds, sudo-gated test fixtures pass through the tail-call chain

The qemu jobs are the meaningful test for "does the verifier accept this on real kernels." If they pass, vanilla 5.15 / 6.6 are fine. UniFi-style stricter accounting will be confirmed via post-merge deployment on the same router that hit the v0.2.4 regression.

Test plan

Pre-merge (CI):

  • cargo fmt --all --check clean
  • cargo clippy --workspace --all-targets --all-features -- -D warnings clean
  • cargo test --workspace --lib 94 + 40 pass on macOS dev host
  • CI fmt+clippy+test passes
  • CI cross-build (4 targets) passes
  • CI qemu-verifier 5.15 + 6.6 passes (with updated harness following the chain)

Post-merge on the deployed UniFi router:

  • apt install ./packetframe_0.2.5_arm64.deb. sudo systemctl restart packetframe.
  • sudo packetframe feasibility --config /etc/packetframe/packetframe.conf --human — all xdp.attach.ethN now PASS (no more 544/512 rejection).
  • sudo packetframe status — the new "tail-call chain" section reports MUTATION_PROGS[0]: populated (finalize).
  • Add a per-prefix mss-clamp directive: mss-clamp 23.191.200.0/24 1360. sudo packetframe reconfigure.
  • tcpdump -i eth2 -n 'tcp[tcpflags] & tcp-syn != 0' -vv confirms wire MSS=1360 on outbound SYNs.
  • sudo packetframe status | grep mss_clamp_applied shows the counter climbing.
  • err_tail_call and err_mutation_ctx stay at 0.

Tag flow after merge

git checkout main && git pull
git tag -a v0.2.5 -m "v0.2.5"
git push origin v0.2.5

(The version bump is in this PR — same pattern as v0.2.4 — so just tag and push.)

🤖 Generated with Claude Code

lunarthegrey and others added 4 commits May 4, 2026 22:38
Fixes the v0.2.4 regression on UniFi 5.15.72-ui-cn9670 (aarch64) where
the kernel rejected fast_path with "combined stack size of 3 calls is
544. Too large" — same bytecode that loaded cleanly on CI's qemu 5.15
(stack 0+360+0+0). UniFi's BPF patches plus aarch64 JIT account stack
~120 bytes higher than vanilla 5.15 on x86_64.

Architecture: two XDP programs in one ELF, chained by bpf_tail_call.
Each gets its own 512-byte stack budget.

  fast_path (XDP, attached per-iface):
    classification (allow-prefix, block-prefix, dry-run)
    FIB lookup (kernel-fib | custom-fib | compare)
    devmap pre-check
    TTL decrement (in-place)
    L2 rewrite (in-place)
    write per-CPU MUTATION_CTX
    bpf_tail_call(MUTATION_PROGS, 0) ────────► finalize (XDP, tail-called):
                                                 read MUTATION_CTX
                                                 mss-clamp lookup + mutation
                                                 VLAN choreography
                                                 bpf_redirect_map

mss-clamp + VLAN + redirect move from forward_success into the new
finalize program; per-prefix LPM keys + TCP-options walk live in
finalize's fresh stack budget. fast_path's responsibilities shrink to
classification + L2/TTL mutation, which fits comfortably under any
kernel's accounting.

This is NOT the multi-module dispatcher (SPEC §3.4 / §5.0). Tail-call
is one-way control transfer between cooperating stages of one logical
pipeline; the dispatcher is for chaining independent modules at the
same hook (ddos, sampler). Both will eventually exist; v0.2.5 ships
only the former.

New BPF maps:
* MUTATION_CTX (PerCpuArray<MutationCtx>): per-CPU scratch carrying
  egress_ifindex, egress_vid, ingress_vid, ip_offset, is_v4 across the
  tail-call boundary. fast_path writes, finalize reads.
* MUTATION_PROGS (ProgramArray, 8 slots): jump table. Slot 0 holds
  finalize today; slots 1-7 reserved for future stages.

New StatIdx counters (append-only):
* 35: err_tail_call — fast_path's tail_call returned an error (slot
  empty). fast_path falls through to XDP_PASS so traffic still flows.
* 36: err_mutation_ctx — finalize couldn't read MUTATION_CTX. Should
  be 0 in steady state.

Userspace lifecycle changes:
* attach() loads finalize first, populates MUTATION_PROGS[0], then
  loads + attaches fast_path. Order matters: fast_path's first packet
  must find a populated slot.
* pin_program_and_maps walks PROGRAM_NAMES (fast_path + finalize); both
  pins survive SIGTERM per SPEC §8.5.
* MAP_NAMES grows to include MUTATION_CTX, MUTATION_PROGS, and the
  v0.2.4 mss-clamp maps that were missing from the previous list.
* Status command reports tail-call chain occupancy ("MUTATION_PROGS[0]:
  populated (finalize)") so operators can confirm wiring.

Test harness:
* Harness::new() now loads both programs and populates MUTATION_PROGS[0]
  before returning, so existing bpf_prog_test_run-based tests follow
  the chain transparently. Kernel's BPF_PROG_TEST_RUN handles
  bpf_tail_call by re-entering its dispatcher for the target program;
  tests see the verdict + mutations from the full chain.

Version bumped 0.2.4 → 0.2.5. README Status table grows a "Two-stage
BPF datapath" row. New runbook at docs/runbooks/tail-call-architecture.md
documents the chain, MutationCtx wire format, debug commands, and
how future stages slot in.

Netns end-to-end integration test (real veth + SYN + capture, asserts
MSS clamped on the wire) is deferred to a follow-up PR. Existing
attach-roundtrip + bpf_prog_test_run fixtures in qemu-verifier validate
that both programs LOAD + attach + the tail-call wires correctly on
kernels 5.15 + 6.6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The newer-kernel verifier on the GitHub Actions runner rejected the
single-entry mss_clamp_inline at the proto-byte read with
`R9 offset is outside of the packet`. The bound check used a
runtime-conditional size (`if is_v4 { 20 } else { 40 }`), which the
verifier could not connect to the subsequent typed cast through
`*const Ipv4Hdr` — so the read at offset 9 (proto field) appeared
unbounded.

Splitting the dispatch upfront lets each path bound-check with a
compile-time constant (`Ipv4Hdr::LEN` / `Ipv6Hdr::LEN`) immediately
followed by the cast and field reads — the same `ptr_at` pattern
main.rs already uses. The qemu kernels (5.15, 6.6) accepted the old
form; the newer runner kernel did not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The verifier's `find_good_pkt_pointers` refuses to propagate readable-
range info through packet-pointer arithmetic when the scalar offset's
umax_value exceeds MAX_PACKET_OFF (0xffff). `mctx.ip_offset` is read
from a per-CPU map, so the verifier sees its full u32 range and skips
range propagation — leaving the post-bound-check pkt pointer with
range=0 and rejecting the subsequent header field read.

Capping `ip_offset` at MAX_IP_OFFSET (64) right after the MUTATION_CTX
read gives the verifier a tight umax it can reason about. fast_path
writes 14 or 18 in practice; 64 leaves headroom for a future second
VLAN tag. Out-of-range is fail-safe XDP_PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capping ip_offset at 64 (previous commit) got the verifier past the IP
header reads but the TCP csum patch still hit "R6 offset is outside of
the packet" at byte 17 of the TCP header. The bound check on
`start + csum_off + 2 > end` did not propagate readable-range back to
the actual read site because LLVM emitted a fresh packet-pointer
arithmetic chain (new id) for the read.

v0.2.4's working pattern derived `ip_offset = (ip as usize) - start`
inside mss_clamp_inline, where the verifier tracks the result as a
`pkt - pkt` subtraction with `umax = MAX_PACKET_OFF (0xffff)` — a
pkt-derived bound that range propagation honors. Pulling the same
pattern into finalize: pass the typed `ip` pointer (already
bounds-checked) into `mss_clamp_tcp` and recover ip_offset there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lunarthegrey lunarthegrey merged commit 04bf598 into main May 5, 2026
10 checks passed
@lunarthegrey lunarthegrey deleted the v0.2.5-tail-call-finalize branch May 5, 2026 04:49
@lunarthegrey lunarthegrey mentioned this pull request May 5, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant